Mon. Not. R. Astron. Soc. 000, [l]{T2] (2012) Printed 2 April 2013 (MN WT^ style file v2.2) 



An improved quasar detection method in EROS-2 and 
MACHO LMC datasets 



o 

(N 



K. Pichara,^'^ P. Protopapas,^'^ D.-W.Kim,^'^ J.-B. Marquette,'* and P. Tisserand^ 

^ Computer Science Department, Pontificia Universidad Catolica de Chile, Santiago, Chile. 

^ Harvard- Smithsonian Center for Astrophysics, Cambridge, MA, USA 

^Institute for Applied Computational Science, Harvard University, Cambridge, MA, USA 

"-UPMC-CNRS, UMR7095, Institut d'Astrophysigue de Paris, F-75014, Paris, France 

^Research School of Astronomy and Astrophysics, Australian National University, Canberra, Australia 



ABSTRACT 



6 



O 

o 

^' 

o 
en 

• • 1 INTRODUCTION 



> 

X 



We present a new classification method for quasar identification in tlie EROS-2 
and MACHO datasets based on a boosted version of Random Forest classifier. We 
use a set of variability features including parameters of a continuous auto regressive 
model. We prove that continuous auto regressive parameters are very important dis- 
criminators in the classification process. We create two training sets (one for EROS-2 
and one for MACHO datasets) using known quasars found in the LMC. Our model's 
accuracy in both EROS-2 and MACHO training sets is about 90% precision and 86% 
recall, improving the state of the art models accuracy in quasar detection. We ap- 
ply the model on the complete, including 28 million objects, EROS-2 and MACHO 
LMC datasets, finding 1160 and 2551 candidates respectively. To further validate our 
list of candidates, we crossmatched our list with a previous 663 known strong candi- 
dates, getting 74% of matches for MACHO and 40% in EROS. The main difference 
on matching level is because EROS-2 is a slightly shallower survey which translates 
to significantly lower signal-to-noise ratio lightcurves. 
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Given the immense amount of data being produced by cur- 
rent deep-sky surveys such as Pan-STARRS (Kaiser et al 



2002J ), and f uture surveys such a s LSST ( ,Matter„2007[ ) and 



SkyMapper (Keller et al. 20071, astronomy is facing new 
challenges on how to analyze big data and thus on how to 
search or predict events/patterns of interest. 

The size of the data has already exceeded the capability 
of manual examination or the capability of standard data 
analysis tools. LSST will produce 15 terabytes of data per 
night, which is even beyond the capacity of typical data 
storage today. 

Thus in order to analyze such a huge amounts of data 
and detect interesting events or patterns with minimum false 
positives, innovative and novel data analysis methods are 
crucial for the success of such surveys. 

In our previous works ( Kim et al.||2011a 20121 we de- 
veloped classification models for the selection of quasars 
from large photometric databases using variability charac- 
teristics as the main discriminators. In particular we used a 
supervised classification model trained using a set of vari- 
ability features calculated from MACHO lightcurves (Al 



cock|2000 l. We applied the trained model to the entire MA- 
CHO database consisting of ~40 million lightcurves and se- 
lected few thousands of quasar candidates. In this paper, 
we present an improved classification model used to de- 
tect quasars on MACH O ( |Alcock|200"ol ) and EROS-2 dataset 
( jTisserand et al.|2007| ). The new model which works over an 
extended set of variability features, substantially decreases 
false positive rate and increases efficiency. 

The actual model improvement is a result of an im- 
provement in the machine learning classification model and 
the lightcurve features we use. Machine learning classifica- 
tion methods have been very popular for many decades. 
These methods are data analysis models that learn to pre- 
dict a categorical variable from a set of other variables (of 
any type). Most known classification models are: decision 
trees (| Quinlan||l993|), naive Bayes |Duda fc Hart]|1973[ ), 



Neural Netwo rks ([Rumelhart et al.| 19861, Support Vec- 
tor Machines (Cortes & Vapnik 1995| and Random For 
est ( Breiman 2001J I. There are some meta-models to im- 
prove classification results like Boosting m ethods |Freund| 
fc Schapire] ( |1997[ ) and Mixtures of Experts ( |Jordan||1994f 7 
among others. In general more recent classifiers are a result 
of research focused on building models able to search for 
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patterns within high dimensional datasets, where the com- 
binatorial number of possible projections of data is large. 

Many machine learning classifiers have been applied to 
the analysis of astronomical data in particular to classify 
transients and variable stars from time series data ( Bloom| 



et al. 20111 Richards et al. 20TT1 Bloom & Richards" 2011 



Debosscher et al. 



200^ 



2010[|Kim et al.||2011a|b[ ). ( IWangeTaT 



Wachman et al. 



2009, Wang et al 



2010 1 proposed an 



algorithm to fit phase-shifted periodic time series using a 
mixture of Gaussian processes. ( Debosscher et al.|2007 l used 
many machine learning classifiers to learn a model that clas- 
sifies variable stars in a sample from Hipparcos and OGLE 
databases. ( Richards et al.|2011 | used Random Forest clas- 
sifier to classify between pulsational variables and eclipsing 
systems used in Milky Way tomography. In [Bloom et al.| 
(20111 they used machine learning algorithms to classify 



transients and variable stars from the Palomar Transient 
Factor y (PTF) survey ( |Rau et al.|2009[ ). In [Wachman et al 



(20091 they used cross correlation as a phase invariant fea- 



ture to be used as a similarity indicator in a kernel function. 

In this work we used a Random Forest classifier 
(Breiman|2001 1 boosted with the AdaBoost algorithm (Fre 



und & Schapire||1997l. The Random Forest classifier comes 



from the well-known decision tree model ( Quinlan 1993 1 and 
Baggging techniques ( Breiman|1996 l, where the model ran- 
domly explores several subsets of features while analyzing 
samples of training data. This model performs very well in 
many machine learning domains ( [Breiman 2001). AdaBoost 
algorithm ( Freund fc Schapire|1997 l is a boosting technique 
which fits a sequence of classification models (in this case 
a sequence of many Random Forests) to different subsets 
of data objects (in our case lightcurves), generating a mix- 
ture of classifiers each one specialized in smaller areas of 
the feature space. We call these classifiers as "weak clas- 
sifiers" or "simpler classifiers" . This is a nice property for 
quasar classification, given that there are only a few known 
training quasars compared with the amount of non-quasars 
lightcurves. Having some weak classifiers that take care of 
some areas with no training quasars helps to filter out many 
non-quasars, while other specialized classifiers perform well 
near the quasar areas in the feature space. 

Besides improving the classification model, we added 
new features as descriptors of lightcurves. These features 
correspond to the parameters of the continuous auto re- 
gressive (CAR(l)) model ( [Belcher et"al]|199 4') fitted to the 
lightcurves. Previous work shows that describing quasars 
using CAR(l) fitting parameters gives suitable results to 



differentiate them from other classes of lightcurves (Kelly 
et al.|2009l. In this work they did not use machine learning 



classifiers to automatically detect quasars, they use CAR(l) 
model to fit 100 quasars lightcurves in order to find corre- 
lations between CAR(l) parameters and luminosity charac- 
teristics. 

In our work we show that by adding CAR(l) features to 



our previous set of features (used in Kim et al. (2011al), we 



can learn more accurate models for quasar detection. Given 
that our model is built to find quasars over dozens of mil- 
lions of stars, we need to be very efficient in the estimation 
of the optimal parameters in order to make the process fea- 
sible within a considerable amount of time. Unfortunately, 
methods such as Metropolis Hastings or Gibbs Sampling are 



not suitable for our purposes, given the computational cost 
they involve. 

To gain efficiency, we reduce the problem by approxi- 
mating one of the parameters (the mean value of the light 
curve) and optimizing the remaining parameters (the am- 
plitude and time scale of the variability) using a multi- 



dimensional unconstrained nonlinear minimization (Nelder 
& Mead 1965!). Once we get the optimal parameters we 



use them as features of the object corresponding to the 
lightcurve. Besides the CAR(l) features we also used time 
series features as in our previous work ( Kim et al.|2011b l, in 
section|4]we give details about all the features we extracted. 
To check the fitting accuracy of our model we first 
calculate the training accuracy of our classifier using 10-fold 
cross validation over a training set, which consists of about 
six thousand known light curves corresponding to different 
kinds of variable stars, non-variable stars and confirmed 
quasars; one set corresponding to the MACHO database 
and another to the EROS-2 database. In the MACHO case 
we substantially improve our training accuracy compared 
with our previous work (Kim et al. 2011bl, increasing 



14.3% in precision and 3.6% in recall for the MACHO 
database. In EROS-2 training database, we get about the 
same training efficiency as in the MACHO case but we 
could not compare to our previous work because this is the 
first time we attempt to classify in EROS-2 database. As 
an extra test for our candidates, we crossmatch them with 
the previous set of strong candidates found in |Kim et al.| 
(2011b I, details are presented in section [s] 



Using parallel computing we decrease the processing 
time to allow us to select quasar candidates from the en- 
tire database within three days. Note that the data analysis 
schema used in this work can be applied to any of the ongo- 
ing and future synoptic sky surveys such as Pan-STARRS, 
LSST, and SkyMapper, among others, r] 

If confirmed the selected quasars from the MACHO 
database will provide critical information for galaxy evolu- 
tion, black hole growth, large scale structure, etc. (|Heckman| 



et al. 2004] [Bower et aLl|2006[ [Trichas et al][2069[ |2010| 



Moreover the resulting quasar lightcurves will be a valuable 
dataset for quasar time variability studies, (e.g. time scale, 
blackhole mass, type i and ii variability) since MACHO and 
EROS lightcurves are well-sampled over 7.4 years ( [Alcock 
[2000). 

The paper is organized as follows, in section [2] we 
present details about EROS-2 database, in section [S] we de- 
scribe in details the classification model we use, including 
the Random Forest Model and AdaBoost, in section 111 we 
describe the features we use to describe the lightcurves, in 
section [5] we describe the experimental results for the MA- 
CHO and EROS-2 dataset. 



2 EROS-2 DATASET 

The EROS-2 collaboration made use of the MARLY tele- 
scope, a one meter diameter Ritchey-Chretien (//5.14) in- 
strument dedicated to the survey. It was operated between 



^ Our main computer resource is [The O dyssey cluster| supported 
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July 1996 and March 2003 at La Silla Observatory (ESO, 
Chile). It was equipped with two wide angle CCD cam- 
eras which are located behind a dichroic beam-splitter. 
Each camera is a mosaic of 8 CCDs, 2 along right ascen- 
sion and 4 along declination. Each CCD has 2048 x 2048 
pixels of 15 X 15 ^rr? individual size, corresponding to a 
0.6 X 0.6 arcsec^ pixel surface on the sky. The size of the 
field of view is 0.7° along right ascension and 1.4° along 
declination. The dichroic beam-splitter allowed simultane- 
ous imaging in two broad non-standard passbands, Be in 
the range 4200-7200 (the so-called "blue" channel), and Re 
in the range 6200-9200 (the so-called "red" channel). The 
blue filter is intermediate between the standard V and R 
standard passbands, while the red filter is analogous to Ic- 
The normalized transmission c urve of these f i lters, compared 
to standard ones, is given by Hamadache ( 2004| P on Fig. 
3.3. Tisserand et al. (20071 give in Eq. (4) the equations to 
transform EROS-2 magnitudes into V and /^ ones within an 
accuracy of 0.1 magnitude. 

The light curves of individual stars were constructed 
from fixed positions on templates using PEIDA, a software 
specifically developed for the photometry of EROS 2 images 
I Ansari|1 996). The nomenclature of objects is defined as in 
Derueet al.| ( |2002| . 



3 METHODOLOGY 

To train a model that learns to detect quasars, we propose 
to use a combination of classifiers. Combination of multi- 
ple classifiers was first proposed by Xu et al. (1992 1. In 



that work, they proved that combining multiple classifiers 
overcome many of the individual classifiers limitations. In 
many pattern recognition problems, such as character recog- 
nition, handwritten text recognition and face recognition 
(Zhao et al. 2003 Plamondon & Srihari 20001, combina- 



tion of multiple classifiers obtain much better classification 
performance. One effective way to combine classifiers is the 
AdaBoost algorithm, proposed in Freund & Schapire (19971). 

The AdaBoost algorithm consists of a set of base clas- 
sifiers that are trained sequentially, such that each classifier 
is trained on the instances where the previous classifier ob- 
tained a bad performance (learn what your partners could 
not learn). In [Freund fc Schapire] ( |1997[ ), they show that if 
the training set used for each classifier depends on the good- 
ness of fit of the previous classifier, then the performance of 
the whole system improves. To make that the base classi- 
fiers focus on different subsets of the training set, we assign 
weights to training data instances. The lower the weight for 
an instance, the less the classifier focuses on it (see section 
3.1 for further details). 

One of the advantages of boosting methods is that af- 
ter the model fitting phase is completed, each of the base 
classifiers become an expert in some subset of data objects. 
This is one of the main reasons that motivate us to use a 
previous boosting step. Given that we have a very small 
amount of known quasars in our training set compared with 
the amount of non quasars, training a set of base classifiers 
that just learn how to filter out some of the non quasars 



would be very helpful for the next base classifier used in the 
sequential process. We now present a detailed description 
of the boosting method we use in this work, the AdaBoost 
algorithm (Freund fc Schapire|1997l. 



3.1 AdaBoost Algorithm 

AdaBoost, short for adaptive boosting, is a machine learn- 



ing algorithm proposed by Freund and Schapire (Freund 



& Schapire 19971. It is a meta-algorithm because it com- 



bines many learning algorithms to perform classification. 
AdaBoost is adaptive in the sense that subsequent classi- 
fiers built are tweaked in favor of those instances misclassi- 
fied by previous classifiers. Although AdaBoost is sensitive 
to noisy data and outliers, it is less susceptible to overfitting 
( Dietterich|1995 1 than most learning algorithms. 

In the context of lightcurve-classification, suppose we 
have a training (labeled) set of n lightcurves and q features 
describing each lightcurve. Each lightcurve in the training 
set has a known given label (e.g. quasar or non-quasar). Let 
[xi , . . . , x„] be a set of n descriptors where each x^ i G 
[1 . . . n] is a vector associated to the lightcurve i where its 
descriptor (features) values are {xn, . . . , Xiq} where q is the 
number of features. Let {yi, . . . , j/„} be the labels such that 
T/i = 1 if the lightcurve i is a quasar and i/i = —1 otherwise. 

Let H be the set of m classifiers {hi, ... , hm}, where hi : 
X ^ Y and D^^' be the distribution of weights on classifiers 
at iteration t. Define m to be the number of classifiers and 
a constant T to be the number of times to iterate in the 
AdaBoost algorithm. 



Initialization: 

X =[xi,X2,. . . ,X„] 

Y = [yi,y2,...,yn] 

T 4,n 

Algorithm: 

for t = 1 to T do 
for j = 1 to m do 

end for 

et := minej 

if et ^ 0.5 then 

break 
end if 

ht := argmin{ej} 

at := |ln((l -et)/et) 

for j = 1 to n do 

^(t+i) _ ^W exp(-at y, ht{x,))/Zt 

end for 
end for 
H(X) := [H{xi),H{x2), ■ ■ ■ ,H{x„)], such that 



'H{xi) = sign I ^ at ht{xi 



Available at URL: http://tel.archives-ouvertes.fr 



Notes: 



4 K. Pichara, P. Protopapas, D.-W. Kim, J.-B. Marquette and P. Tisserand 



• 5ij is the Kronecker delta. 

• Zt is a, normalization factor 



^t = ^ df ' exp(-at t/i ht{xi) 



voted class among the set of P decision trees (see Breiman 



(2001 1 for more details). In Breiman (2001 1 they show that 



as the number of trees tend to infinity the classification er- 
ror of the RF becomes bounded and the classifier does not 
overfit the data. 



The equation to update the classifier weight distribution is 
constructed so that —ayiht{xi) < 1 when yi = ht{xi) and 
—ayiht{xi) > 1 when yt ^ ht{xi). Thus, after selecting an 
optimal classifier /it, for the distribution Dt, the objects Xi 
that classifier ht classified correctly are given less weight and 
those that it identified incorrectly are given more weight. 
Hence, when the algorithm proceeds to test the classifiers 
on D'*"*"^', it is more likely to select a classifier that better 
classifies the objects that ht missed. Adaboost minimizes 
the training error (exponentially fast) if each weak classier 
performs better than random guessing (et < 0.5). 

The base classifier we used in this work is the Random 
Forest classifier ( |Breiman|2001[ ) , a very strong classifier that 
has shown very good results in many different domains. The 
following section shows details about the Random Forest 
classifier. 



3.2 Random Forest Classifier 

Random Forests (RF) is a popular and very efficient 



1993 



1996 



algorithm based on decision tree models (Quinlan 
and Bagging for classification problems ( [Breiman 
20011. It belongs to the family of ensemble methods, 
appearing in machine learning literature at the end of 
nineties (|Dietterich|2000| and has been used recently in the 



astronomical journals (Carliles et al. 2010 Richards et al 



2011 ). The process of training or building a Random Forest 



given training data is as follows: 

• Let P be the number of trees in the Forest and F the 
number of features on each tree, both values are model pa- 
rameters. 

• Build P sets of n samples taken with replacement from 
the training set; this is called bagging. Note that each of the 
P bags has the same number of elements from the training 
set but less different examples, given that the samples are 
taken with replacement. 

• For each of the P sets, train a decision tree using a ran- 
dom sample of F features from the set of q possible features. 

The Random Forest classifier creates many linear sepa- 
rators inside many feature-subsets until it gets suitable sep- 
arations between objects from different classes. Linear sep- 
arations come from each decision tree, each of the feature- 
subsets come from the random feature selection process on 
each tree. The bagging procedure is very useful to estimate 
the error of the classifier during the training process. This er- 
ror can be estimated using out-of-the-bag procedure, which 
means, "evaluate the performance of each tree using the ob- 
jects not selected in the bag which belong to the tree" (see 
[Breiman (2001J for further details). 

After training the Random Forest, to classify a new un- 
known lightcurve descriptor, one uses each of the decision 
trees already trained with the Random Forest to classify 
the new unknown instance and the final decision is the most 



4 FEATURE EXTRACTION 

We extracted 14 features per each band for each lightcurve. 
Those features correspond to 11 time series features used in 
our previous work ( [Kim et aL]|2011b| ) and 3 features corre- 
sponding to the CAR(l) process. 



4.1 Time Series features 

Here we very briefiy summarize the 11 time series features 
used in our previous work (Kim et al.|2011bL 



• Nabove,Nbeiow'- Is the uumber of points above/below the 
upper/lower bound line calculated as points that are ±4(j 
over the average of the autocorrelation functions. 

• Stetson KaC- Is the variability index derived based 



on the autocorrelation function of each lightcurve ( Stetson 
T996|. 

• Rca' Is the range of the cumulative sums (starting from 



1 to the number of observations) of each lightcurve ( Ellaway 
[T978| ). 

• a/fh: The ratio of the standard deviation, a, to the 
mean magnitude, m. 

• Period and Period S/N: Using Lomb-Scargle algorithm 
(Lomb 1976; Scargle 1982 1 we used the period with the high- 



est value in the periodiogram along with the signal to noise 
of the best period. 

• Stetson L: Is a variability index (Stetson 19961 that 



describes the synchronous variability of different bands. 

• rj: Is the ratio of the mean of the square of successive 
differences to the variance of data points. 

• B — R: Average color for each lightcurve 

• Con: Is the number of three consecutive data points 
that are brighter or fainter than 2a and normalized the num- 
ber by TV - 2. 



4.2 Continuous Auto Regressive Process Features 

We use continuous time auto regressive model (CAR(l)) to 
model irregular sampled time series in MACHO and EROS- 
2 lightcurves. CAR(l) process has three parameters, it pro- 
vides a natural and consistent way of estimating a character- 
istic time scale and variance of lightcurves. CAR(l) process 
is described be the following stochastic differential equation 
( Brockwell fc Davis[[2002| 



dX{t) = --X{t)dt + oc^te{t) + bdt, (1) 

r 

for r, ac , t > 
where the mean value of the lightcurve X{t) is br and the 

2 

variance is ■^-^- r is the relaxation time of the process X{t), 
it can be interpreted as describing the variability amplitude 
of the time series, ac can be interpreted as describing the 
variability of the time series on time scales shorter than r. 



Quasar detection method m EROS-2 and MACHO LMC 5 



e(t) is a white noise process with zero mean and variance 
equal to one. The likelihood function of a CAR(l) model for 
a lightcurve with observations x = {x\, . . . ,Xn} observed 
at times {ti,...,i„} with measurement error variances 
,51} is: 



{<5f,. 



p(x|6,crc,r) = 
Xo = 

Xi = 



n 



1 {Xi -X*) 



rac 



aiXi^i + 



^^-l+51_-, 



f2o(l - til) 



+aifii-i 1 



fli-i 



!^»-i+<5f_i 



-(^^-^i-l)/T 



Xi — br (3) 

(4) 

(5) 



(x*_i+4i-i) (6) 



(7) 
(8) 





To find the optimal parameters we maximize the like- 
lihood with respect to ac, b and r. Given that the likeli- 
hood does not have an analytical solution, we can solve it 
with a statistical sampling method such Metropolis Hastings 
I Metropolis et al. 111953 1. Given that we extract features for 



all the lightcurves in EROS-2 and MACHO datasets (about 
28 and 40 millions of stars respectively), performing a statis- 
tical sampling process to determine the optimal parameters 
would be feasible only in cases where stable solutions are 
found in a reasonable amount of time. We consider that less 
than 3 seconds is reasonable given our hardware resources. 
Unfortunately we could not get stable solutions considering 
that restriction. To overcome this situation we simplify the 
optimization problem by reducing the number of parame- 
ters to be estimated. Instead of estimating ac, b and r, we 
just estimate ac and r and then we calculate b as the mean 
magnitude of the lightcurve divided by r. To check that this 
estimation works well, we use a sample of 250 lightcurves 
and compare the reduced Chi-square error using two and 
three parameters optimization, getting differences smaller 
than 2.5% in average. 

This approximation allows us to perform a two dimen- 
sional optimization which can be solved with a regular nu- 
merical method in less than one second per lightcurve. We 
used the Nelder-Mead multidimensional unconstrained non- 
linear optimization ( Nelder fc Mead | |1965 I to find the opti- 
mal parameters. Figure 111 shows the fitting of three quasar 
lightcurves with the resulting CAR(l) coefficients using the 
Nelder-Mead algorithm. Note that instead of using b directly 
as a feature, we use the mean magnitude of the lightcurve 
(m), in order to have a cleaner feature (6 is calculated from 
r, which is already used as a feature). 




X 




Figure 1. Quasar lightcurves (red circles) fitted with optimal 
CAR(l) model (gray lines) using Nelder-Mead. 



5 QSO CANDIDATES ON EROS-2 AND 
MACHO DATASETS 

5.1 EROS-2 dataset 

To train a model able to find quasars in EROS-2 we create 
a training set composed of 65 known quasars, 67 Be stars. 
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330 Long Periodic stars, 5829 non-variable stars, 1727 RR 
Lyrae, 406 Cepheids, and 488 EB stars. We get these stars 
cross-matching the EROS-2 dataset with MACHO known 
stars using positional matching with 3 arcsec of accuracy. 
We extracted features in bands R and B. Figures [2] and 
[3] show projections of the training set on different sets of 
features containing CAR(l) features. In many cases is easy 
to get a natural separation between quasars and the variable 
stars, but usually quasars overlap many of the non variable 
stars (ex. ac with B — R, crc with r, m with r). Fortunately, 
there are many projections where quasars and non- variable 
stars are mostly separated, (ex. ac with Con, ac with m, 
ac with Stetson Kac , t with Res, t with Stetson Kac-) 

To compare the distribution of the objects predicted as 
quasars with the training quasars and other variable stars, 
we plot our EROS-2 training data plus the predicted quasars 
projected on many different pairs of features (Figs. El , [5|. 
We can see that in most of the cases Predicted quasars and 
Training quasars have very similar distributions regardless 
of the small amount of training quasars we use. Main dif- 
ferences between both distributions are in general because 
of the big difference in size comparing training and testing 
data, resulting in a set of predicted quasars ~ 20 times big- 
ger than the training quasars set. 

To get an indicator of the accuracy in the training 
set on EROS-2 dataset, we run a 10 fold cross validation. 
This validation method consists of partitioning the dataset 
in 10 folds (subsets) of the same size, we iterate 10 times, 
on iteration k we train the classifier with all the folds but 
the fold k, then we test the performance on the fold k (the 
one which the model did not see during the training). The 
process returns the model prediction for the entire dataset 
(the union of the 10 testing folds is equal to the data). 
We measure the accuracy using the F-score indicator. This 
indicator is calculated as the harmonic mean of precision 
and recall: 

F-Score = 2 x ^'■'^"■''""xrecan 

prec^s^on-^-recaU 

Where precision and recall are defined as: 



precision — 



tp 



tp+fp 



recall = — ^ 



tp + fn 



Where tp, fp and fn are the number of true positives, 
false positives and false negatives respectively. 

Table [1] show the results for the boosted version of Ran- 
dom Forest, regular Random Forest and SVM (classifier used 
in our previous work ( |Kim et al.|2011b I ) with and without 
CAR features. 

We find 1160 candidates in the EROS-2 dataset. To 
validate our candidates we crossmatch them with the list of 



663 MACHO strong candidates in Kim et al. (2011b I. From 



that list, only 332 objects exists in EROS-2 dataset, we find 
191 matches between our EROS-2 candidates and those 332 
objects. Figure [6] shows some of the lightcurves of the quasar 
candidates for the EROS-2 dataset. 

Regarding the efficiency in the extraction of the CAR(l) 
features and the time series features, we implemented par- 
allel processing in order to perform the features extraction 
and classification in a reasonable amount of time. EROS-2 
and MACHO databases are stored as a set of thousands of 



Table 1. F-score for the EROS-2 training set using 10- fold cross 
validation for different classifieation models. Each classifiers is 
tuned with the optimal set of parameters. We can see that the 
boosted version of Random Forest with CAR features outper- 
forms other classification models. In all cases using CAR features 
improves the result of the corresponding classifier. 

SVM SVM RE RF AB+RE AB-I-RF 
No CAR CAR No CAR CAR No CAR CAR 



0.74 



0.855 



0.787 



0.813 



0.81 



0.868 



folders where each folder contains thousands of lightcurves 
of a given field. The feature extraction process runs as a 
set of parallel threads that run over different compressed 
files at the same time, extracting them and processing the 
lightcurves to get the features. Once the features are calcu- 
lated they are written into a common file related to a partic- 
ular folder, so each compressed file has a corresponding data 
file that stores the feature values of all the lightcurves within 
the folder. After the feature extraction process we run a clas- 
sification process that runs in parallel over the thousands of 
data feature files calculated in the previous step. 



5.2 MACHO dataset 

MACHO was a survey which observed the sky starting in 
July 1992 and ending in 1999 to detect microlensing events 
produced by Milky Way halo objects. Several tens of mil- 
lions of stars where observed in the Large Magellanic Cloud 
(LMC), Smal l Magellanic Cloud (SMC) and Galactic bulge 
( Alcock|2000l ). 

For the MACHO dataset we built a training set com- 
posed of 3969 non-variable stars, 127 Be stars, 78 Cepheids, 
193 eclipsing binaries, 288 RR Lyrae, 574 microlensing, 359 
long-period variables, and 58 quasars. We get the variable 
stars fronr the list of known MACHO variable sou rces ex- 
tract ed from SIMBAD s MACHO variable catalog ( jAlcock 
2001 1 and also from several other literature sources ( Alcock 



1997a|b| |Wood|[2000l [Keller et al.l[2002l |Thomas|[2005| ). To 
get the non variable stars, we randomly chose a subset of 
MACHO lightcurves from a few MACHO LMC fields and 
removed all the known MACHO variables from the subset. 

Each lightcurve is described as a feature vector which 
contains 28 features, 14 features for band B and 14 features 
for band R as described in section |4] 

Figures W\ and [S] show the training set projected on 
a two variables feature space. We can see that ac and r 
features show separations between two groups of classes: 
i) non- variables, Cepheid and Eclipsing Binaries stars and 
ii) quasars, Microlensings, LPVs and Be stars. Combining 
m and r we can see a cluster of quasars, which overlaps 
with some of the Be stars, non-variables, Microlensing and 
long period variables, but separates very well quasars from 
Cepheids, Eclipsing Binaries stars and most of the non- 
variables. Projecting on ac and m we can see that quasars 
separates from LPVs, Cepheids, Eclipsing Binaries, most of 
Be stars, most of the Microlensings and most of the non- 
variables. The biggest overlap is with Microlensings. 



http://vizier. u-strasbg.fr/viz-bin/ VizieR?-source=II/247 
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Figure 2. Projections on diiTerent pairs of CAR(l) features for EROS-2 training data 



Table 2. F-Score for the MACHO training set using 10-fold cross 
validation for different classification models. Eacli classifiers is 
tuned with the optimal set of parameters. We can see that the 
boosted version of Random Forest with CAR features outper- 
forms other classification models. In all cases using CAR features 
improves the result of the corresponding classifier. 



SVM 
No CAR 


SVM 
CAR 


RF 
No CAR 


RF 
CAR 


AB+RF 
No CAR 


AB+RF 
CAR 


0.787 


0.824 


0.826 


0.841 


0.844 


0.877 



By examining these projections we can see that quasars 
are clustered in high values of r, with higher values com- 
pared to Eclipsing Binaries, Cepheids and RR Lyraes. gc is 
very good to separate quasars from non- variables, also from 
Cepheids, RR Lyraes and Eclipsing Binaries stars, ac is not 
a good feature to separate quasars from Microlensings, Be 
stars and LPVs, but combining ac with B — R we get a 
strong separation between them. 

Table[2]shows comparative results among different clas- 
sification models. We included a Support Vector Machine, 
Random Forest and Radom Forest Boosted with AdaBoost. 
On each case the classifier is tuned with the optimal set of 
parameters. 

After we select and fit the model to the training set, 
we run on the whole MACHO data (about 40 million of 



lightcurves) , from where we get 2551 quasar candidates. We 
crossmatch our candidates with the 2566 and 663 strong 
candidates in our previous work ( Kim et al.||2011b l getting 
1148 and 494 matches respectively. 

Figure [9] shows some of the new candidates we find that 
are not in the previous list for MACHO candidates in |Kim| 
[eFaL] ( [20TTbl ) 

There are some cases where the model confuses a pe- 
riodic star with a quasar. Figure [lO] shows one example of 
this case. 

To analyze the distribution of predicted quasars in the 
feature space we show some projections of the training data 
plus the predicted quasars. Figures ITT] and [12] show the dis- 
tribution of predicted quasars , training quasars and all the 
other classes of stars. As in the EROS-2 case, we can see that 
in many cases the predicted quasars show similar distribu- 
tions compared with training quasars. There are some cases 
where a big portion of the predicted quasars is expanded out 
of the concentrated cluster of training quasars, for example, 
combining oc and B — R 



6 SUMMARY 

In this work we present a new list of candidate quasars from 
MACHO and EROS-2 datasets. This new list is obtained 
using a new model that uses continuous auto correlation 
features plus time series features to feed a boosted version 
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Figure 3. Projections on different pairs of features, combining CAR{1) features with time series features for EROS-2 training data 



of the Random Forest classifier ( Breinianl|2001 1. With this 
model we obtain a list of 1160 candidates for the EROS- 
2 and 2551 candidates for the MACHO dataset. From our 
MACHO candidates we crossmatch them with the old list of 



candidates from Kim et al. ( 2011b I and we get 1148 matches 



We crossmatch our EROS-2 candidates with the list of 663 



MACHO strong candidates in Kim et al. ( 2011b \. From that 
list, only 332 objects exist in the EROS-2 dataset, and we 
find 131 matches between our EROS-2 candidates and those 
332 objects (see table|3|. We prove that using boosted Ran- 
dom Forest with CAR(l) features we improve the fitting of 



the model to the training set in both EROS-2 and MACHO 
datasets. 

We show that quasars are well separated from many 
other kind of variable stars using CAR(l) features com- 
bined with time series features. We also proved that adding 
CAR(l) features, SVM, Random Forest and Boosted Ran- 
dom Forest improve their training accuracy. There are some 
challenges to overcome in future work such as the confusion 
of some periodic stars with quasars. We notice that about 
25% of false positives correspond to periodic stars. We be- 
lieve that adding a dedicated module to filter periodic stars 
we can improve the results. 
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Figure 4. Predicted quasars and Training stars distributions projected on different pairs of CAR(l) features for EROS-2 data. 



Previous Candidates 
MACHO (Afi) 


Previous Strong 
Candidates MACHO (M2) 


New list of MACHO 
Candidates (M3) 


List of EROS-2 
Candidates (E\) 


2566 


663 


2551 


1160 


Matches between 
{hh) and (Ml) 


Matches between 
(Ms) and (Af2) 


Objects from (M2) 
Catalogued in EROS-2 


Matches between 
(ME) and (Ei) 


1148 


491 


332 (ME) 


131 



Table 3. Table summarizing crossmatching results between different lists of quasars candidates 
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Figure 7. Training set projected on different pairs of CAR(l) features for MACHO data. 
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Figure 8. Projections on different pairs of features, combining CAR(l) features with time series features for MACHO training data 
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Figure 9. Lightcurves of new quasars candidates predicted from MACHO dataset 
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Figure 10. Lightcurvc of a wrongly predicted quasar in MACHO dataset 
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Figure 11. Predicted quasars and training stars distributions projected on different pairs of CAR(l) features for MACHO data. 
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Figure 12. Predicted quasars and Training stars distributions for ac and r features combined with three time series features for MACHO 
data. 



